1 Introduction

Bayesian ANalysis of Differential Localisation Experiments (BANDLE) is an integrative semi-supervised functional mixture model, developed by (Crook et al. 2021), to obtain the probability of a protein being differentially localised between two conditions.

In this vignette we walk users through how to install and use the R (R Development Core Team 2011) Bioconductor (Gentleman et al. 2004) bandle package by simulating a well-defined differential localisation experiment from spatial proteomics data from the pRolocdata package (Gatto et al. 2014).

The BANDLE method uses posterior Bayesian computations performed using Markov-chain Monte-Carlo (MCMC) and thus uncertainty estimates are available (Gilks, Richardson, and Spiegelhalter 1995). Throughout this vignette we use the term differentially localised to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. One of the main outputs of BANDLE is the probability that a protein is differentially localised between two conditions.

2 The data

In this vignette and (Crook et al. 2021), the main data source that we use to study differential protein sub-cellular localisation are data from high-throughput mass spectrometry-based experiments. The data from these types of experiments traditionally yield a matrix of measurements wherein we have, for example, PSMs, peptides or proteins along the rows, and samples/channels/fractions along the columns. The bandle package uses the MSnSet class as implemented in the Bioconductor MSnbase package and thus requires users to import and store their data as a MSnSet instance. For more details on how to create a MSnSet see the relevant vignettes in pRoloc. There is also additional information and examples in the pRoloc sister package. The pRolocdata experiment data package is a good starting place to look for test data. This data package contains tens of quantitative proteomics experiments, stored as MSnSets.

2.1 A well-defined theoretical example

We are going to generate a differential localisation experiment. They key elements are replicates, and a perturbation of interest. There is code within the bandle package to simulate an example experiment.

In the code chunk below we begin by loading the pRolocdata package to obtain a spatial proteomics dataset. This will be the basis of our simulation which will use boostrapping to generate new datasets. The dataset we have chosen to load is a dataset from 2009 (tan2009r1). This is data from a early LOPIT experiment performed on Drosophila embryos (Tan et al. 2009). The aim of this experiment was to apply LOPIT to an organism with heterogeneous cell types. This experiment used four isotopes across four distinct fractions and thus yielded four measurements (features) per protein profile.

library("pRolocdata")
data("tan2009r1")
plot2D(tan2009r1,
       main = "An example spatial proteomics datasets", 
       grid = FALSE)
addLegend(tan2009r1, where = "topleft", cex = 0.7, ncol = 2)

The following code chuck simulates a differential localisation experiment. It will generate numRep/2 of each a control and treatment condition. We will also simulate relocalisations for numDyn proteins.

set.seed(1)
tansim <- sim_dynamic(object = tan2009r1,
                      numRep = 6,
                      numDyn = 100)

The list of the 6 simulated experiments are found in tansim$lopitrep. Each one is an MSnSet instance (the standard data container for proteomics experimental data). The first 3 are the simulated control experiments (see tansim$lopitrep[1:3]), and the following 3 in the list are the treatment condition simulated experiments (see tansim$lopitrep[4:6]). We can plot them using the plot2D function from pRoloc.

plot_title <- c(paste0("Replicate ", seq(3), " condition", " A"), 
               paste0("Replicate ", seq(3), " condition", " B"))

par(mfrow = c(2, 3))
out <- lapply(seq(tansim$lopitrep), function(z) 
    plot2D(tansim$lopitrep[[z]], grid = FALSE, main = plot_title[z]))

For understanding, exploring and visualizing individual spatial proteomics experiments, see the vignettes in pRoloc and MSnbase packages.

tansim$lopitrep[[1]]
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: X114 X115 X116 X117
##   varLabels: Fractions
##   varMetadata: labelDescription
## featureData
##   featureNames: P20353 P53501 ... P07909 (888 total)
##   fvarLabels: FBgn Protein.ID ... knn.scores (18 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 19317464 
## Annotation:  
## - - - Processing information - - -
## Added markers from  'mrk' marker vector. Thu Jul 16 22:53:44 2015 
## Performed knn prediction (k=10) Tue Dec 14 23:52:16 2021 
##  MSnbase version: 1.17.12

3 Preparing for bandle analysis

The main function of the package is bandle, this uses a complex model to analyse the data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior distribution of parameters and latent variables. From which statistics of interest can be computed. Here we only run a few iterations for brevity but typically one needs to run thousands of iterations to ensure convergence, as well as multiple parallel chains.

3.1 Fitting Gaussian processes

First, we need to fit non-parametric regression functions to the markers profiles, upon which we place our analysis. This uses Gaussian processes. The fitGPmaternPC function can be used and fits some default penalised complexity priors (see ?fitGP), which work well. However, these can be altered, which is demonstrated in the next code chunk

par(mfrow = c(3,4))
gpParams <- lapply(tansim$lopitrep, function(x) fitGPmaternPC(x))

We apply the fitGPmaternPC function to each datasets by calling lapply over the tansim$lopitrep list of datasets. The output of fitGPmaternPC returns a list of posterior predictive means and standard deviations. As well as MAP hyperparamters for the GP. As a side effect a profile plot is produced for each class in each replicate condition where the posterior predictive distributions are overlayed with markers protein profiles.

The prior needs to form a K*3 matrix. K corresponds to the number of subcellular classes in the data, and 3 columns for (1) the prior, (2) length-scale amplitude and (3) standard deviation parameters (see hyppar in ?fitGP). Increasing these values, increases the shrinkage. For more details see the manuscript by Crook et al. (2021).

K <- length(getMarkerClasses(tansim$lopitrep[[1]], fcol = "markers"))
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(10, 60, 250),
                                       each = K), ncol = 3)

Now we have generated these complexity priors we can pass them as an argument to the fitGPmaternPC function. For example,

par(mfrow = c(3,4))
gpParams <- lapply(tansim$lopitrep,
                   function(x) fitGPmaternPC(x, hyppar = pc_prior))

By looking at the plot of posterior predictives we can see the GP fit looks sensible.

3.2 Setting the prior on the weights

The next step is to set up the matrix Dirichlet prior on the mixing weights. These weights are defined across datasets so these are slightly different to mixture weights in usual mixture models. The \((i,j)^{th}\) entry is the prior probability that a protein localises to organelle \(i\) in the control and \(j\) in the treatment. This mean that off-diagonal terms have a different interpretation to diagonal terms. Since we expect re-localisation to be rare, off-diagonal terms should be small. The following functions help set up the priors and how to interpret them. The parameter q allow us to check the prior probability that more than q differential localisations are expected.

set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.001, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = tansim$lopitrep[[1]],
                               dirPrior = dirPrior,
                               q = 15)

The mean number of re-localisations is small:

predDirPrior$meannotAlloc
## [1] 0.3515246

The prior probability that more than q differential localisations are expected is small

predDirPrior$tailnotAlloc
## [1] 6e-04

The full prior predictive can be visualised as histogram. The prior probability that proteins are allocated to different components between datasets concentrates around 0.

hist(predDirPrior$priornotAlloc, col = getStockcol()[1])

If the the distribution did not concentrated around zero, as in most use-cases we would expect the number of re-localisations to be small, users could try testing smaller values in dirPrior for example, 0.0005 or smaller, instead of 0.001.

4 Running the bandle function

We are now ready to run the main bandle function. Remember to carefully select the datasets and replicates that define the control and treatment. As a reminder, in this “Getting started” vignette we have used a small dataset and generated theoretical triplicates of each theoretical condition. Please see the second vignette in this package for a more detailed workflow and real biological use-case. In the below code chunk we run bandle for only 50 iterations for the convenience of building the vignette, but typically we’d recommend you run the number of iterations (numIter) in the \(1000\)s.

Remember: the first 3 datasets are the first 3 elements of tansim and the final 3 datasets are the treatment triplicates.

control <- tansim$lopitrep[1:3] 
treatment <- tansim$lopitrep[4:6]

bandleres <- bandle(objectCond1 = control,
                    objectCond2 = treatment,
                    numIter = 50, # usually 10,000
                    burnin = 5, # usually 5,000
                    thin = 1, # usually 20
                    gpParams = gpParams,
                    pcPrior = pc_prior,
                    numChains = 1, # usually >=4
                    dirPrior = dirPrior)

The bandle function generates an object of class bandleParams. The show method indicates the number of parallel chains that were run, this should typically be greater than 4 (here we use 1 just as a demo).

bandleres
## Object of class "bandleParams"
## Method: bandle 
## Number of chains: 1

5 Analysing bandle output

Before we can begin to extract protein allocation information and a list of proteins which are differentially localised between conditions, we first need to populate the bandleres object by calling the bandleProcess function.

5.1 Populating a bandleres object

Currently, the summary slots of the bandleres object are empty. The summaries function accesses them.

summaries(bandleres)
## [[1]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
## 
## Slot "diagnostics":
## <0 x 0 matrix>
## 
## Slot "bandle.joint":
## <0 x 0 matrix>
## 
## 
## [[2]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
## 
## Slot "diagnostics":
## <0 x 0 matrix>
## 
## Slot "bandle.joint":
## <0 x 0 matrix>

These can be populated as follows

bandleres <- bandleProcess(bandleres)

These slot have now been populated

summary(summaries(bandleres))
##      Length Class         Mode
## [1,] 1      bandleSummary S4  
## [2,] 1      bandleSummary S4

5.1.1 bandle results

We can save the results by calling summaries. We see that it is of length 2. 1 for control and 1 for treatment.

res <- summaries(bandleres)
length(res)
## [1] 2

There are a number of slots,

str(res[[1]])
## Formal class 'bandleSummary' [package "bandle"] with 3 slots
##   ..@ posteriorEstimates:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
##   .. .. ..@ rownames       : chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. ..@ nrows          : int 677
##   .. .. ..@ listData       :List of 7
##   .. .. .. ..$ bandle.allocation               : Named chr [1:677] "PM" "Golgi" "Ribosome 40S" "PM" ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. .. ..$ bandle.probability              : Named num [1:677] 1 0.994 1 0.968 0.996 ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. .. ..$ bandle.outlier                  : num [1:677] 0 0 0 0 0 0 0 0 0 0 ...
##   .. .. .. ..$ bandle.probability.lowerquantile: Named num [1:677] 1 0.979 1 0.896 0.983 ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. .. ..$ bandle.probability.upperquantile: Named num [1:677] 1 0.999 1 1 1 ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. .. ..$ bandle.mean.shannon             : Named num [1:677] 0.00 0.00 5.74e-08 0.00 7.71e-04 ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. .. ..$ bandle.differential.localisation: Named num [1:677] 0 0 0 0.0667 0 ...
##   .. .. .. .. ..- attr(*, "names")= chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ diagnostics       : logi [1, 1] NA
##   ..@ bandle.joint      : num [1:677, 1:11] 1.27e-100 1.45e-140 1.10e-17 8.40e-136 4.09e-56 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:677] "P20353" "P53501" "Q7KU78" "P04412" ...
##   .. .. ..$ : chr [1:11] "Cytoskeleton" "ER" "Golgi" "Lysosome" ...

The main one of interest is the posteriorEstimates slot,

res[[1]]@posteriorEstimates
## DataFrame with 677 rows and 7 columns
##        bandle.allocation bandle.probability bandle.outlier
##              <character>          <numeric>      <numeric>
## P20353                PM           0.999998              0
## P53501             Golgi           0.994177              0
## Q7KU78      Ribosome 40S           1.000000              0
## P04412                PM           0.968461              0
## Q7KJ73      Ribosome 60S           0.996389              0
## ...                  ...                ...            ...
## Q95TL8                ER           1.000000              0
## P25007        Proteasome           0.999999              0
## P41374      Cytoskeleton           0.999975              0
## Q8SZM1        Peroxisome           0.863698              0
## P07909           Nucleus           1.000000              0
##        bandle.probability.lowerquantile bandle.probability.upperquantile
##                               <numeric>                        <numeric>
## P20353                         0.999995                         1.000000
## P53501                         0.979071                         0.999196
## Q7KU78                         1.000000                         1.000000
## P04412                         0.895974                         1.000000
## Q7KJ73                         0.982639                         1.000000
## ...                                 ...                              ...
## Q95TL8                         1.000000                         1.000000
## P25007                         0.999994                         1.000000
## P41374                         0.999924                         0.999998
## Q8SZM1                         0.526096                         0.992823
## P07909                         1.000000                         1.000000
##        bandle.mean.shannon bandle.differential.localisation
##                  <numeric>                        <numeric>
## P20353         0.00000e+00                        0.0000000
## P53501         0.00000e+00                        0.0000000
## Q7KU78         5.73779e-08                        0.0000000
## P04412         0.00000e+00                        0.0666667
## Q7KJ73         7.71039e-04                        0.0000000
## ...                    ...                              ...
## Q95TL8         0.00000e+00                         0.000000
## P25007         3.95029e-07                         0.000000
## P41374         0.00000e+00                         0.000000
## Q8SZM1         7.38636e-03                         0.866667
## P07909         2.28152e-07                         0.000000

This output object is a data.frame containing the protein allocations and associated localisation probabilities (including the upper and lower quantiles of the allocation probability distribution), the mean Shannon entropy and the bandle.differential.localisation probability.

5.2 Extracting posteriors and allocation results

We create two new objects pe1 and pe2 in the below code chunk which contain the output of the posteriorEstimates slot.

pe1 <- res[[1]]@posteriorEstimates
pe2 <- res[[2]]@posteriorEstimates

One quantity of interest is the protein allocations, which we can plot as a barplot.

alloc1 <- pe1$bandle.allocation
alloc2 <- pe2$bandle.allocation

par(mfrow = c(1, 2), oma = c(6,2,2,2))
barplot(table(alloc1), col = getStockcol()[2],
        las = 2, main = "Protein allocation: control")
barplot(table(alloc2), col = getStockcol()[2],
        las = 2, main = "Protein allocation: treatment")

The barplot tells us for this example that bandle has allocated the majority of unlabelled proteins to the ER, followed by the Golgi (irrespective of the posterior probability).

The associated posterior estimates are located in the bandle.probability column.

pe_alloc1 <- pe1$bandle.probability
pe_alloc2 <- pe1$bandle.probability

5.3 Allocation probabilities

The full allocation probabilities are stored in the tagm.joint slot. These can be visualised in a heatmap

bjoint_control <- summaries(bandleres)[[1]]@bandle.joint
pheatmap(bjoint_control, cluster_cols = FALSE, color = viridis(n = 25))

bjoint_treatment <- summaries(bandleres)[[2]]@bandle.joint
pheatmap(bjoint_treatment, cluster_cols = FALSE, color = viridis(n = 25))

5.4 Predicting the subcellular location

We can append the results to our original MSnSet datasets using the bandlePredict function.

xx <- bandlePredict(control, 
                    treatment, 
                    params = bandleres, 
                    fcol = "markers")
res_control <- xx[[1]]
res_treatment <- xx[[2]]

The output is a list of MSnSets. In this example, we have 3 for the control and 3 for the treatment.

length(res_control)
## [1] 3
length(res_treatment)
## [1] 3

The results are appended to the first MSnSet feature data slot for each condition.

fvarLabels(res_control[[1]])
##  [1] "FBgn"                             "Protein.ID"                      
##  [3] "Flybase.Symbol"                   "AccessionNo"                     
##  [5] "EntryName"                        "AccessionNoAll"                  
##  [7] "EntryNameAll"                     "No.peptide.IDs"                  
##  [9] "Mascot.score"                     "No.peptide.quantified"           
## [11] "PLSDA"                            "pd.2013"                         
## [13] "pd.markers"                       "markers.orig"                    
## [15] "markers"                          "markers.tl"                      
## [17] "knn"                              "knn.scores"                      
## [19] "bandle.allocation"                "bandle.probability"              
## [21] "bandle.probability.lowerquantile" "bandle.probability.upperquantile"
## [23] "bandle.mean.shannon"              "bandle.differential.localisation"
## [25] "bandle.outlier"                   "bandle.joint"

To access them use the fData function

fData(res_control[[1]])$bandle.probability
fData(res_control[[1]])$bandle.allocation

5.4.1 Thresholding on protein allocations

It is common practice in supervised machine learning to set a specific threshold on which to define new assignments/allocations, below which classifications are left unassigned/unknown. Indeed, we do not expect the whole subcellular diversity to be represented by the 11 niches defined here, we expect there to be many more, many of which will be multiply localised within the cell. It is important to allow for the possibility of proteins to reside in multiple locations (this information is available in the bandle.joint slot - see above for more details on multiple location).

As we are using a Bayesian model the outputs of the classifier are probabilities. This not only allows us to look at the distribution of probabilities over all subcellular classes but also allows us to extract a probability threshold on which we can define new assignments.

The subcellular allocations are located in the bandle.allocation column of the fData slot and the posteriors are located in the bandle.probability slot. We can use the getPredictions function from the pRoloc package to return a set of predicted localisations according to if they meet a probability threshold.

For example, in the below code chunk we set a 1% FDR for assigning proteins a subcellular nice, below which we leave them unlabelled.

res_control[[1]] <- getPredictions(res_control[[1]], 
                                   fcol = "bandle.allocation",                   
                                   scol = "bandle.probability",                   
                                   mcol = "markers",                   
                                   t = .99)
## ans
##  Cytoskeleton            ER         Golgi      Lysosome mitochondrion 
##            12           220           133             9            84 
##       Nucleus    Peroxisome            PM    Proteasome  Ribosome 40S 
##            40             7           134            45            74 
##  Ribosome 60S       unknown 
##            50            80
res_treatment[[1]] <- getPredictions(res_treatment[[1]], 
                                   fcol = "bandle.allocation",                   
                                   scol = "bandle.probability",                   
                                   mcol = "markers",                   
                                   t = .99)
## ans
##  Cytoskeleton            ER         Golgi      Lysosome mitochondrion 
##            17           192           121            15            84 
##       Nucleus    Peroxisome            PM    Proteasome  Ribosome 40S 
##            48            14           137            50            71 
##  Ribosome 60S       unknown 
##            49            90

We may also wish to take into account the probability of the protein being an outlier and thus use the results in the bandle.outlier column of the feature data. We could calculate the product of the posterior and the outlier (as they are both probabilities) to obtain a localisation score that takes into account the outlier model. More details on this are found in the second vignette of this package.

5.5 Differential localisation probability

As previously mentioned the term “differentially localised” is used to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. For the majority of users this is the main output they are keen to extract using the BANDLE method.

Following on from the above example, after extracting posterior estimates for each condition using the summaries function we can also access the differential localisation probability as it is stored in the bandle.differential.localisation column of the data.frames of pe1 and pe2, in the above sections.

The differential localisation probability tells us which proteins are most likely to differentially localise. If we take a 1% FDR and examine how many proteins get a differential probability greater than 0.99 we find there are 74 proteins above this threshold.

diffloc_probs <- pe1$bandle.differential.localisation
head(diffloc_probs, 50)
##     P20353     P53501     Q7KU78     P04412     Q7KJ73     Q9VM65     Q9VCK0 
## 0.00000000 0.00000000 0.00000000 0.06666667 0.00000000 0.00000000 0.00000000 
##     B7Z0W3     Q9V415     Q00174     Q9V769     Q27593     Q9V780     P19109 
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
##     Q95NR4     O77047     Q8SXD0     Q9VBV5     Q9VBU5     Q8IA62     Q9Y105 
## 0.00000000 0.04444444 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
##     Q9W2M4     P26308     Q9VP77     Q5U0Z2     Q960E2     Q9VLJ6     Q9VJ39 
## 0.00000000 1.00000000 1.00000000 0.00000000 0.06666667 1.00000000 0.95555556 
##     Q9VTX8     Q9VTZ5     B7Z0C1     Q9VRJ4     M9PCB7     P46150     A1ZBH5 
## 0.00000000 0.35555556 0.00000000 0.00000000 0.00000000 0.06666667 0.00000000 
##     Q9W3M7     A8DZ29     Q9VN14     Q9VZL3     M9PC99     Q86BP3     Q9W3N9 
## 0.00000000 0.44444444 0.11111111 0.00000000 0.02222222 0.00000000 0.00000000 
##     Q7K5M6     P16620     P48375     Q9VMB9     Q9VI55     Q9VU58     Q9V498 
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.22222222 0.00000000 
##     Q9V496 
## 1.00000000
length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.99))
## [1] 76

This can also be seen on a rank plot

plot(diffloc_probs[order(diffloc_probs, decreasing = TRUE)],
     col = getStockcol()[3], pch = 19, ylab = "Probability",
     xlab = "Rank", main = "Differential localisation rank plot")

In-line with our expectations, the rank plot indicates that most proteins are not differentially localised.

5.5.1 Estimating uncertainty

5.5.1.1 Applying the bootstrapdiffLocprob function

We can examine the top n proteins (here we use an example of top = 100) and produce bootstrap estimates of the uncertainty (note here the uncertainty is likely to be underestimated as we did not produce many MCMC samples). These can be visualised as ranked boxplots.

set.seed(1)
boot_t <- bootstrapdiffLocprob(params = bandleres, top = 100,
                               Bootsample = 5000, decreasing = TRUE)

boxplot(t(boot_t), col = getStockcol()[5],
        las = 2, ylab = "Probability", ylim = c(0, 1),
        main = "Differential localisation \nprobability plot (top 100 proteins)")

5.5.1.2 Applying the binomDiffLoc function

Instead of applying the bootstrapdiffLocprob we could use the binomDiffLoc function to obtain credible intervals from the binomial distribution.

bin_t <- binomialDiffLocProb(params = bandleres, top = 100,
                             nsample = 5000, decreasing = TRUE)

boxplot(t(bin_t), col = getStockcol()[5],
        las = 2, ylab = "Probability", ylim = c(0, 1),
        main = "Differential localisation \nprobability plot (top 100 proteins)")

5.5.1.3 Obtaining probability estimates

There are many ways we could obtain probability estimates from either of the above methods. We could, for example, take the mean of each protein estimate, or compute the cumalative error (there is not really a false discovery rate in Bayesian statistics) or we could threshold on the interal to reduce the number of differential localisations if you feel the model has been overconfident.

# more robust estimate of probabilities
dprobs <- rowMeans(bin_t)

# compute cumalative error, there is not really a false discovery rate in
# bayesian statistics but you can look at the cummatlive error rate
ce <- cumsum(1  - dprobs)

# you could threshold on the interval and this will reduce the number of
# differential localisations
qt <- apply(bin_t, 1, function(x) quantile(x, .025))

6 Visualising differential localisation

We can visualise the changes in localisation between conditions on an alluvial plot using the plotTranslocations function

plotTranslocations(bandleres)

By default, irrespective of probability, the predicted allocation is taken from

Or alternatively, on a chord (circos) diagram

plotTranslocations(bandleres, type = "chord")

Lastly, we can also pass the argument table = TRUE to the plotTranslocations function to display a summary table of the number of proteins that have changed in location between conditions

(sum.res <- plotTranslocations(bandleres, table = TRUE))
##        Condition1    Condition2 value
## 11             ER  Cytoskeleton     2
## 12             ER         Golgi     1
## 13             ER      Lysosome     6
## 14             ER mitochondrion     1
## 15             ER       Nucleus     4
## 16             ER    Peroxisome     7
## 17             ER            PM     4
## 18             ER    Proteasome     5
## 19             ER  Ribosome 40S     1
## 22          Golgi            ER     1
## 24          Golgi mitochondrion     5
## 25          Golgi       Nucleus     3
## 27          Golgi            PM     1
## 29          Golgi  Ribosome 40S     1
## 30          Golgi  Ribosome 60S     4
## 41  mitochondrion  Cytoskeleton     2
## 43  mitochondrion         Golgi     2
## 45  mitochondrion       Nucleus     1
## 56        Nucleus    Peroxisome     1
## 63     Peroxisome         Golgi     5
## 71             PM  Cytoskeleton     1
## 72             PM            ER     1
## 73             PM         Golgi     3
## 78             PM    Proteasome     6
## 82     Proteasome            ER     1
## 85     Proteasome mitochondrion     1
## 86     Proteasome       Nucleus     1
## 88     Proteasome            PM     3
## 89     Proteasome  Ribosome 40S     1
## 90     Proteasome  Ribosome 60S     3
## 91   Ribosome 40S  Cytoskeleton     1
## 93   Ribosome 40S         Golgi     5
## 95   Ribosome 40S mitochondrion     2
## 96   Ribosome 40S       Nucleus     1
## 99   Ribosome 40S    Proteasome     1
## 100  Ribosome 40S  Ribosome 60S     3
## 109  Ribosome 60S    Proteasome     4
## 110  Ribosome 60S  Ribosome 40S     1

6.0.1 Important consideration

These visualisations are showing the change in class label between the two conditions (as assigned by bandle i.e. the result stored in bandle.allocation). The results are taken directly from bandleres and thus no thresholding on the class label and posterior to allow for proteins to be left “unassigned” or unknown, is conducted. Furthermore, there is not thresholding on the bandle.differential.localisation probability.

It would be better to re-plot these figures to get a better representation of what is moving. The easiest way to do this is to pass the MSnSets output after performing bandlePredict and getPredictions.

For example, first let’s identify which proteins get a high differential localisation probability,

ind <- which(fData(res_control[[1]])$bandle.differential.localisation > 0.99)
res_dl_control <- res_control[[1]][ind, ]
res_dl_treatment <- res_treatment[[1]][ind, ]

Now we can plot only these and also specify where the prediction results are located,

mycols <- c(getStockcol()[seq(getMarkerClasses(res_control[[1]]))], "grey")
names(mycols) <- c(getMarkerClasses(res_control[[1]]), "unknown")

plotTranslocations(list(res_dl_control, res_dl_treatment), 
                   fcol = "bandle.allocation.pred", col = mycols)
## 76 features in common
## ------------------------------------------------
## If length(fcol) == 1 it is assumed that the
## same fcol is to be used for both datasets
## setting fcol = c(bandle.allocation.pred,bandle.allocation.pred)
## ----------------------------------------------

(final.res <- plotTranslocations(params = list(res_dl_control, res_dl_treatment), 
                                 fcol = "bandle.allocation.pred", table = TRUE))
##        Condition1    Condition2 value
## 1              ER         Golgi     1
## 2              ER mitochondrion     1
## 3              ER       Nucleus     4
## 4              ER    Peroxisome     7
## 5              ER            PM     4
## 6              ER    Proteasome     4
## 7              ER  Ribosome 40S     1
## 9              ER       unknown     1
## 10             ER  Cytoskeleton     1
## 11             ER      Lysosome     6
## 12          Golgi            ER     1
## 13          Golgi mitochondrion     3
## 14          Golgi       Nucleus     2
## 18          Golgi  Ribosome 40S     1
## 20          Golgi       unknown     2
## 25  mitochondrion       Nucleus     1
## 31  mitochondrion       unknown     1
## 32  mitochondrion  Cytoskeleton     1
## 37        Nucleus    Peroxisome     1
## 46     Peroxisome         Golgi     2
## 56             PM            ER     1
## 57             PM         Golgi     1
## 61             PM    Proteasome     5
## 64             PM       unknown     1
## 67     Proteasome            ER     1
## 69     Proteasome mitochondrion     1
## 70     Proteasome       Nucleus     1
## 72     Proteasome            PM     2
## 73     Proteasome  Ribosome 40S     1
## 74     Proteasome  Ribosome 60S     2
## 75     Proteasome       unknown     1
## 79   Ribosome 40S         Golgi     2
## 80   Ribosome 40S mitochondrion     1
## 84   Ribosome 40S    Proteasome     1
## 85   Ribosome 40S  Ribosome 60S     2
## 87   Ribosome 40S  Cytoskeleton     1
## 95   Ribosome 60S    Proteasome     2
## 103       unknown       Nucleus     1
## 105       unknown            PM     1
## 106       unknown    Proteasome     2
## 109       unknown  Cytoskeleton     1

7 Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.6.2        viridisLite_0.4.0    pheatmap_1.0.12     
##  [4] pRolocdata_1.32.0    ggplot2_3.3.5        bandle_1.0          
##  [7] pRoloc_1.34.0        BiocParallel_1.28.3  MLInterfaces_1.74.0 
## [10] cluster_2.1.2        annotate_1.72.0      XML_3.99-0.8        
## [13] AnnotationDbi_1.56.2 IRanges_2.28.0       MSnbase_2.20.1      
## [16] ProtGenerics_1.26.0  mzR_2.28.0           Rcpp_1.0.7          
## [19] Biobase_2.54.0       S4Vectors_0.32.3     BiocGenerics_0.40.0 
## [22] BiocStyle_2.22.0    
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.13        BiocFileCache_2.2.0    plyr_1.8.6            
##   [4] splines_4.1.2          listenv_0.8.0          GenomeInfoDb_1.30.0   
##   [7] digest_0.6.29          foreach_1.5.1          htmltools_0.5.2       
##  [10] magick_2.7.3           gdata_2.18.0           ggalluvial_0.12.3     
##  [13] fansi_0.5.0            magrittr_2.0.1         memoise_2.0.1         
##  [16] doParallel_1.0.16      mixtools_1.2.0         limma_3.50.0          
##  [19] recipes_0.1.17         globals_0.14.0         Biostrings_2.62.0     
##  [22] gower_0.2.2            lpSolve_5.6.15         prettyunits_1.1.1     
##  [25] colorspace_2.0-2       ggrepel_0.9.1          blob_1.2.2            
##  [28] rappdirs_0.3.3         xfun_0.28              dplyr_1.0.7           
##  [31] crayon_1.4.2           RCurl_1.98-1.5         jsonlite_1.7.2        
##  [34] hexbin_1.28.2          impute_1.68.0          survival_3.2-13       
##  [37] iterators_1.0.13       glue_1.5.1             gtable_0.3.0          
##  [40] ipred_0.9-12           zlibbioc_1.40.0        XVector_0.34.0        
##  [43] kernlab_0.9-29         shape_1.4.6            future.apply_1.8.1    
##  [46] scales_1.1.1           vsn_3.62.0             mvtnorm_1.1-3         
##  [49] DBI_1.1.1              xtable_1.8-4           progress_1.2.2        
##  [52] clue_0.3-60            bit_4.0.4              proxy_0.4-26          
##  [55] mclust_5.4.8           preprocessCore_1.56.0  lbfgs_1.2.1           
##  [58] MsCoreUtils_1.6.0      lava_1.6.10            prodlim_2019.11.13    
##  [61] sampling_2.9           httr_1.4.2             FNN_1.1.3             
##  [64] RColorBrewer_1.1-2     ellipsis_0.3.2         farver_2.1.0          
##  [67] pkgconfig_2.0.3        nnet_7.3-16            sass_0.4.0            
##  [70] dbplyr_2.1.1           utf8_1.2.2             caret_6.0-90          
##  [73] reshape2_1.4.4         tidyselect_1.1.1       rlang_0.4.12          
##  [76] munsell_0.5.0          tools_4.1.2            LaplacesDemon_16.1.6  
##  [79] cachem_1.0.6           generics_0.1.1         RSQLite_2.2.9         
##  [82] evaluate_0.14          stringr_1.4.0          fastmap_1.1.0         
##  [85] mzID_1.32.0            yaml_2.2.1             ModelMetrics_1.2.2.2  
##  [88] knitr_1.36             bit64_4.0.5            randomForest_4.6-14   
##  [91] purrr_0.3.4            KEGGREST_1.34.0        dendextend_1.15.2     
##  [94] ncdf4_1.18             future_1.23.0          nlme_3.1-153          
##  [97] xml2_1.3.3             biomaRt_2.50.1         compiler_4.1.2        
## [100] filelock_1.0.2         curl_4.3.2             png_0.1-7             
## [103] e1071_1.7-9            affyio_1.64.0          tibble_3.1.6          
## [106] bslib_0.3.1            stringi_1.7.6          highr_0.9             
## [109] lattice_0.20-45        Matrix_1.4-0           vctrs_0.3.8           
## [112] pillar_1.6.4           lifecycle_1.0.1        BiocManager_1.30.16   
## [115] GlobalOptions_0.1.2    jquerylib_0.1.4        MALDIquant_1.20       
## [118] data.table_1.14.2      bitops_1.0-7           R6_2.5.1              
## [121] pcaMethods_1.86.0      affy_1.72.0            bookdown_0.24         
## [124] gridExtra_2.3          parallelly_1.29.0      codetools_0.2-18      
## [127] gtools_3.9.2           MASS_7.3-54            assertthat_0.2.1      
## [130] withr_2.4.3            GenomeInfoDbData_1.2.7 parallel_4.1.2        
## [133] hms_1.1.1              grid_4.1.2             rpart_4.1-15          
## [136] timeDate_3043.102      tidyr_1.1.4            coda_0.19-4           
## [139] class_7.3-19           rmarkdown_2.11         segmented_1.3-4       
## [142] pROC_1.18.0            lubridate_1.8.0

References

Crook, Oliver M, Colin TR Davies, Laurent Gatto, Paul DW Kirk, and Kathryn S Lilley. 2021. “Inferring Differential Subcellular Localisation in Comparative Spatial Proteomics Using BANDLE.” bioRxiv.
Gatto, Laurent, Lisa M. Breckels, Samuel Wieczorek, Thomas Burger, and Kathryn S. Lilley. 2014. “Mass-Spectrometry Based Spatial Proteomics Data Analysis Using pRoloc and pRolocdata.” Bioinformatics.
Gentleman, Robert C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biol 5 (10): –80. https://doi.org/10.1186/gb-2004-5-10-r80.
Gilks, Walter R, Sylvia Richardson, and David Spiegelhalter. 1995. Markov Chain Monte Carlo in Practice. CRC press.
R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.
Tan, Denise JL, Heidi Dvinge, Andrew Christoforou, Paul Bertone, Alfonso Martinez Arias, and Kathryn S Lilley. 2009. “Mapping Organelle Proteins and Protein Complexes in Drosophila Melanogaster.” Journal of Proteome Research 8 (6): 2667–78.